Randy Koliha
Isaiah Sarria
Owen Telis
John Levitt
Steam is a popular, online video game distributor with many other complementing features. Steams hosts community discussion threads, access to download community-made modifications to games sold on the platform, social media-like user profiles and friends lists that displays the users overall or recent video game activity, and sales events which are based on holidays or unique themes.
We want to analyze a dataset regarding Steam games to dive deep into the behavior of gamers on the platform.
<<<<<<< HEAD ## Packages =======
Loading any needed packages.
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5 ✓ purrr 0.3.4
✓ tibble 3.1.6 ✓ dplyr 1.0.8
✓ tidyr 1.2.0 ✓ stringr 1.4.0
✓ readr 2.1.2 ✓ forcats 0.5.1
── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(scales)
Attaching package: ‘scales’
The following object is masked from ‘package:purrr’:
discard
The following object is masked from ‘package:readr’:
col_factor
library(directlabels)
The games.csv dataset has data collected for games on the popular game license selling platform, Steam, over the months from the years July 2012- Febuary 2021. The data set has records of 1258 games (also includes other miscellaneous pieces of software), and includes their titles, monthly player peaks, monthly average players at the same time, monthly gains/losses of players compared to the previous month, and the percentage of how closely the average players approach the peak.
# Storing dataset in `games`
games <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-16/games.csv")
#cleaning it up
games <- games %>%
mutate( number_of_month = match(games[1:83631,3], month.name) ) %>%
select(gamename,year,number_of_month,everything()) %>%
arrange(desc(year), number_of_month)
games
summary(games)
gamename year number_of_month month avg gain
Length:83631 Min. :2012 Min. : 1.000 Length:83631 Min. : 0.0 Min. :-250249.0
Class :character 1st Qu.:2016 1st Qu.: 3.000 Class :character 1st Qu.: 53.1 1st Qu.: -38.2
Mode :character Median :2018 Median : 7.000 Mode :character Median : 203.1 Median : -1.6
Mean :2017 Mean : 6.546 Mean : 2765.7 Mean : -10.3
3rd Qu.:2019 3rd Qu.:10.000 3rd Qu.: 764.0 3rd Qu.: 22.2
Max. :2021 Max. :12.000 Max. :1584886.8 Max. : 426446.1
NA's :1258
peak avg_peak_perc
Min. : 0 Length:83631
1st Qu.: 137 Class :character
Median : 500 Mode :character
Mean : 5470
3rd Qu.: 1727
Max. :3236027
# Finding the distinct number of `gamenames`
dist_g <- games %>%
distinct(gamename)
dist_g
# in alpha order
dist_g_alpha <- games %>%
distinct(gamename) %>%
arrange(gamename)
dist_g_alpha
g_payday <- games %>%
filter(gamename == "PAYDAY 2") %>%
filter(year %in% c(2012:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_payday
#makes the PAYDAY 2 dataframe
Years_Payday2 <- as.factor(g_payday$year)
ggplot(g_payday,aes(x = number_of_month,y = avg, group = year, color = Years_Payday2)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#FF0004", "#4363d8", "#911eb4", "#f032e6", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
labs(title = "Payday 2: Average Players by Year")
#auto color pallet if + scale manual is removed
Why Payday 2 Spiked In 2017
g_pubg <- games %>%
filter(gamename == "PLAYERUNKNOWN'S BATTLEGROUNDS") %>%
filter(year %in% c(2017:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_pubg
Years_PUBG <- as.factor(g_pubg$year)
ggplot(g_pubg,aes(x = number_of_month,y = avg, group = year, color = Years_PUBG)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
labs(title = "PLAYERUNKNOWN'S BATTLEGROUNDS: Average Players by Year")
g_csgo <- games %>%
filter(gamename == "Counter-Strike: Global Offensive") %>%
filter(year %in% c(2012:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_csgo
Years_csgo <- as.factor(g_csgo$year)
ggplot(g_csgo,aes(x = number_of_month,y = avg, group = year, color = Years_csgo)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#FF0004", "#4363d8", "#911eb4", "#f032e6", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
labs(title = "Counter Strike: Global Offensive: Average Players by Year")
g_gtav <- games %>%
filter(gamename == "Grand Theft Auto V") %>%
filter(year %in% c(2012:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_gtav
Years_gtav <- as.factor(g_gtav$year)
ggplot(g_gtav,aes(x = number_of_month,y = avg, group = year, color = Years_gtav)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#FF0004", "#4363d8", "#911eb4", "#f032e6", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
labs(title = "Grand Theft Auto V: Average Players by Year")
g_dota2 <- games %>%
filter(gamename == "Dota 2") %>%
filter(year %in% c(2012:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_dota2
Years_dota2 <- as.factor(g_dota2$year)
ggplot(g_dota2,aes(x = number_of_month,y = avg, group = year, color = Years_dota2)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#FF0004", "#4363d8", "#911eb4", "#f032e6", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
scale_y_continuous(labels = comma) +
labs(title = "DOTA 2: Average Players by Year")
CSGO also was given out for free in 2020 so that shows simularites with the games given out for free over time.
This data shows that many multiplayer games seem to gain player over time and often have their peak average players occur in a period signficantly after the launch of the game.
g_fallout4 <- games %>%
filter(gamename == "Fallout 4") %>%
filter(year %in% c(2012:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_fallout4
Years_fallout4 <- as.factor(g_fallout4$year)
ggplot(g_fallout4,aes(x = number_of_month,y = avg, group = year, color = Years_fallout4)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#FF0004", "#4363d8", "#911eb4", "#f032e6", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
labs(title = "Fallout 4: Average Players by Year")
g_farcry5 <- games %>%
filter(gamename == "Far Cry 5") %>%
filter(year %in% c(2012:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_farcry5
Years_farcry5 <- as.factor(g_farcry5$year)
ggplot(g_farcry5,aes(x = number_of_month,y = avg, group = year, color = Years_farcry5)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#FF0004", "#4363d8", "#911eb4", "#f032e6", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
labs(title = "Farcry 5: Average Players by Year")
g_cyberpunk <- games %>%
filter(gamename == "Cyberpunk 2077") %>%
filter(year %in% c(2012:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_cyberpunk
Years_cyberpunk <- as.factor(g_cyberpunk$year)
ggplot(g_cyberpunk,aes(x = number_of_month,y = avg, group = year, color = Years_cyberpunk)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#FF0004", "#4363d8", "#911eb4", "#f032e6", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
labs(title = "Cyberpunk 2077: Average Players by Year")
g_girl <- games %>%
filter(gamename == "Hentai Girl") %>%
filter(year %in% c(2012:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_girl
Years_girl <- as.factor(g_girl$year)
ggplot(g_girl,aes(x = number_of_month,y = avg, group = year, color = Years_girl)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#FF0004", "#4363d8", "#911eb4", "#f032e6", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
labs(title = "Hentai Girl: Average Players by Year")
This data shows that singleplayer titles seem to have a drastic fall off in their player base following the launch of the game. This most likely occurs because once someone beats a single player game they are less likely to return to it. This can also be attributed to the fact that multiplayer/esports games are usally continually updated with new content single player game typically are less likely to receive these updates consistently.
games_seasons <- games %>%
filter(year %in% c(2012:2021)) %>%
group_by(number_of_month) %>%
summarise(avg_month_sum = sum(avg))
ggplot(data = games_seasons) +
geom_line(aes(x = number_of_month, y = avg_month_sum),color = "blue", size = 1) +
geom_point(aes(x = number_of_month, y = avg_month_sum),color = "red", size = 3)+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1)) +
scale_y_continuous(labels = comma) +
xlab("Months")+
ylab("Average Number of Players") +
labs(title = "Steam Seasonality using Avg. Players")
games_seasons
games_seasons_pk <- games %>%
filter(year %in% c(2012:2021)) %>%
group_by(number_of_month) %>%
summarise(peak_month_sum = sum(peak))
ggplot(data = games_seasons_pk) +
geom_line(aes(x = number_of_month, y = peak_month_sum),color = "blue", size = 1) +
geom_point(aes(x = number_of_month, y = peak_month_sum),color = "red", size = 3)+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1)) +
scale_y_continuous(labels = comma) +
xlab("Months")+
ylab("Peak Number of Players") +
labs(title = "Steam Seasonality using Peak Players")
games_seasons_pk
NA
NA
This data suggests that there is a seasonality to Steam’s player data. This shows that Steam has the most players around December, Janurary, and Feburary. We suspect that this occurs because of the Christmas season and people recieving money and games as gifts. The steep fall off we see in Feburary we suspect occurs because individuals have completed the games they recieved during the holiday season. We also see an increase in players during the summer months. We suspect that this occurs because individuals are out of school for their summer break, giving them more time to play video games on Steam.
games_user_growth <- games %>%
filter(year %in% c(2012:2020)) %>%
group_by(year) %>%
summarise(avg_month_all_years = sum(avg))
ggplot(data = games_user_growth) +
geom_line(aes(x = year, y = avg_month_all_years),color = "blue", size = 1) +
geom_point(aes(x = year, y = avg_month_all_years),color = "red", size = 3) +
scale_x_continuous(breaks = seq(2012, 2020, by = 1)) +
scale_y_continuous(labels = comma) +
xlab("Years")+
ylab("Average Number of Players") +
labs(title = "Steam Platform Growth using Avg. Players")
games_user_growth
games_user_growth_pk <- games %>%
filter(year %in% c(2012:2020)) %>%
group_by(year) %>%
summarise(peak_month_all_years = sum(peak))
ggplot(data = games_user_growth_pk) +
geom_line(aes(x = year, y = peak_month_all_years),color = "blue", size = 1) +
geom_point(aes(x = year, y = peak_month_all_years),color = "red", size = 3) +
scale_x_continuous(breaks = seq(2012, 2020, by = 1)) +
scale_y_continuous(labels = comma) +
xlab("Years")+
ylab("Peak Number of Players") +
labs(title = "Steam Platform Growth using Peak Players")
games_user_growth_pk
NA
NA
Looking at the data we can see that the Steam platform has had relatively consistent growth from 2012 to 2020. However can see the only time Steam has a drop in users was from 2018 to 2019. We suspect this to be related to the rise of other online games market places such as the EPIC Games store and the popularity of Fortnite at the time. In November of 2018 Fortnite hit a with 8.3 million concurrent players. We suspect that the popuarity of Fortnite combined with the fact that it was not on Steam is ultimately what caused the dip in players from 2018-2019.
Question 4, Randy: total number of significant titles on the steam platform every years.
g_without_2021 <- games %>%
filter(number_of_month == 12, year %in% 2012:2020)
g_added <- g_without_2021 %>%
ggplot()+
geom_bar(aes(x = year))+
scale_x_continuous(breaks = seq(2012, 2020, by = 1)) +
xlab("Years")+
ylab("Number of Games") +
labs(title = "Number of Games Published on Steam")
g_added
g_added_count <- g_without_2021 %>%
count(year)
grow_diff_titles <- c(268,268,393,549,720,911,1034,1101,1165)
g_added_count$differ <- grow_diff_titles
g_added_count <- g_added_count %>%
mutate(subtracted = n - differ)
g_added_count %>%
ggplot() +
geom_line(aes(x = year, y = subtracted, group = 1), size = 1, color = "red")+
geom_point(aes(x = year, y = subtracted, group = 1), size = 3, color = "red")+
scale_x_continuous(breaks = seq(2012, 2020, by = 1)) +
xlab("Years")+
ylab("Numbers of Games") +
labs(title = "Growth of Games Published to Steam by Year")
g_added_count
NA
When analyzing this data we are looking at what we consider to be relevant titles published on the Steam platform. Anyone can publish their game on Steam and as of 2020 their were nearly 50,000 titles published on the platform. The data utilized in the data set is pulled from a data set that foucses on games that are actually played. When looking at the graphs we can see that the game library on Steam has been increasing over the years however we can also see that there was a large dip in the number of relevant games published to the platform between 2016 and 2018
games %>%
filter(avg == max(avg))
PLAYERUNKNOWN’S BATTLEGROUNDS is the game with the best avg player performance.
games %>%
filter(peak == max(peak))
The game with the highest peak players in Steam history is PLAYERUNKNOWN’S BATTLEGROUNDS with 3,236,027 players.
games %>%
drop_na(gain) %>%
filter(gain == max(gain))
NA
Playersunknown Battlegrounds has had the highest peak and avg players as well as greatest gain from the prior month in all of steam history.
games %>%
drop_na(gain) %>%
filter(gain == min(gain))
This makes sense considering cyberpunk was badly reviewed after the released. The game was filled with bugs that made the game hard to enjoy so a lot of people returned the game as well.